arXiv:1501.05152vl [cs.CV] 21 Jan 2015 


Mirror, mirror on the wall, tell me, is the error small? 


Heng Yang 

Queen Mary University of London 

heng.yang@qmul.ac.uk 


Abstract 

Do object part localization methods produce bilaterally 
symmetric results on mirror images ? Surprisingly not, even 
though state of the art methods augment the training set 
with mirrored images. In this paper we take a closer look 
into this issue. We first introduce the concept of mirrorabil- 
ity as the ability of a model to produce symmetric results in 
mirrored images and introduce a corresponding measure, 
namely the mirror error that is defined as the difference be¬ 
tween the detection result on an image and the mirror of the 
detection result on its mirror image. We evaluate the mir- 
rorability of several state of the art algorithms in two of the 
most intensively studied problems, namely human pose es¬ 
timation and face alignment. Our experiments lead to sev¬ 
eral interesting findings: 1) Surprisingly, most of state of 
the art methods struggle to preserve the mirror symmetry, 
despite the fact that they do have very similar overall per¬ 
formance on the original and mirror images; 2) the low mir- 
rorability is not caused by training or testing sample bias - 
all algorithms are trained on both the original images and 
their mirrored versions; 3) the mirror error is strongly cor¬ 
related to the localization/alignment error (with correlation 
coefficients around 0.7). Since the mirror error is calcu¬ 
lated without knowledge of the ground truth, we show two 
interesting applications - in the first it is used to guide the 
selection of difficult samples and in the second to give feed¬ 
back in a popular Cascaded Pose Regression method for 
face alignment. 

1. Introduction 

The evolution of mirror (bilateral) symmetry has pro¬ 
foundly impacted animal evolution [7]. As a consequence, 
the overwhelming majority of modern animals (>99%), in¬ 
cluding humans, exhibit mirror symmetry. As shown in 
Fig. 1, the mirror of an image depicting such objects shows 
a meaningful version of the same objects. Taking face im¬ 
ages as a concrete example, a mirrored version of a face 
image is perceived as the same face. In recent year, object 
(parts) localization has made significant progress and sev- 
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(a) Mirror error 0.2. (b) Mirror error 0.02. 



(c) Mirror error 0.6. (d) Mirror error 0.02. 


Figure 1: Example pairs of localization results on original 
(left) and mirror (right) images. First row: Human Pose 
Estimation [24], second row: Face Alignment by RCPR [4]. 
The first column (a and c) shows large mirror error and the 
second (b and d) small mirror error. Can we evaluate the 
performance without knowing the ground truth? 

eral methods have reported close-to-human performance. 
This includes localization of objects in images (e.g. pedes¬ 
trian or face detection) or fine-grained localization of object 
parts (e.g. face parts localization, body parts localization, 
bird parts localization). Most of those methods augment 
the training set by mirroring the positive training samples. 
However, are these models able to give symmetric results 
on a mirror image during testing? 

In order to answer this question we first introduce the 
concept of mirrorability, i.e., the ability of an algorithm to 
give on a mirror image bilaterally symmetric results, and a 
quantitative measure called the mirror error. The latter is 
defined as the difference between the detection result on an 
image and the mirror of detection result on its mirror im¬ 
age. We evaluate the mirrorability of several state of the 
art algorithms in two representative problems (face align¬ 
ment and human pose estimation) on several datasets. One 
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would expect that a model that has been trained on a dataset 
augmented with mirror images to give similar results on an 
image and its mirrored version. However, as can be seen in 
Fig. 1 first column, several state of the art methods in their 
corresponding problems sometimes struggle to give sym¬ 
metric results in the mirror images. And for some samples 
the mirror error is quite large. By looking at the mirrora- 
bility of different approaches in human pose estimation and 
face alignment, we arrive at three interesting findings. First, 
most of the models struggle to preserve the mirrorability - 
the mirror error is present and sometimes significant; Sec¬ 
ond, the low mirrorability is not likely to be caused by train¬ 
ing or testing sample bias - the training sets are augmented 
with mirrored images; Third, the mirror error of the samples 
is highly correlated with the corresponding ground truth er¬ 
ror. 

This last finding is significant since one of the nice prop¬ 
erties of the proposed mirror error is that it is calculated 
’blindly’, i.e., without using the ground truth. We rely on 
this property in order to show two examples of how it could 
be used in practice. In the first one the mirror error is used 
as a guide for difficult samples selection in unlabelled data 
and in the second one it is used to provide feedback on a 
cascaded pose regression method for face alignment. In the 
former application, the samples selected based on the mirror 
error have shown high consistency across different meth¬ 
ods and high consistency with the difficult samples selected 
based on the ground truth alignment error. In the latter ap¬ 
plication, the feedback mechanism is used in a multiple ini¬ 
tializations scheme in order to detect failures - this leads to 
large improvements and state of the art results in face align¬ 
ment. 

To summarize, in this paper we make the following con¬ 
tributions: 

• To the best of our knowledge, we are the first to look 
into the mirror symmetric performance of object part 
localization models. 

• We introduce the concept of mirrorability and show 
how the corresponding measure, called mirror error, 
that we propose can be used in evaluating general ob¬ 
ject part localization methods. 

• We evaluate the mirrorability of several algorithms in 
two domains (i.e. face alignment and body part local¬ 
ization) and report several interesting findings on the 
mirrorability. 

• We show two applications of the mirrorability in the 
domain of face alignment. 

2. Mirrorability in Object Part Localization 
2.1. Mirrorability concepts and definitions 

We define mirrorability as the ability of a 
model/algorithm to preserve the mirror symmetry when 


applied on an image and its mirror image. In order to 
quantify it we introduce a measure called mirror error 
that is defined as the difference between a detection result 
on an image and the mirror of the result on its mirror 
image. Specifically, let us denote the shape of an object, 
for example a human or a face, by a set of K points, 
X = {yik}k=i^ where x/. are the coordinates of the k-\h 
point/part. The detection result on the original image is 
denoted by = {^yik}k=i the detection result on the 
mirror image is denoted by ^X = {Px/.}^^. The mirror 
transformation of ^X to the original image is denoted by 
p^ctX = where p^^x/. denotes the mirror 

result of the k-\h part on the original image. Generally, a 
different index k' is used on the mirror image (e.g. a left 
eye in an image becomes a right eye in the mirror image). 
Therefore, the transformation consists of image coordinates 
transform and the part index mirror transform {k' k). 

The image coordinate transform is applied on the horizontal 
coordinate, that is p^/c = w/ — where w/ is the width 
of the image / and Px/. is the x coordinate of the k point 
in the mirror image. The index re-assignment is based on 
the the mirror symmetric structure of a specific object, with 
an one-to-one mapping list where, for example, the left 
eye index is mapped to the right eye index. Formally, the 
mirror error of the k landmark (body joint or facial point) 
is defined as H^x/^; — P^Px/e||, and the sample-wise mirror 
error as: 
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The mirror error that is defined in the above equation has 
the following properties: First, a high mirror error reflects 
low mirrorability and vice visa; Second, it is symmetric, 
i.e., given a pair of mirror images it makes no difference 
which is considered to be the original; Third, and impor¬ 
tantly, calculating the mirror error does not require ground 
truth information. 

In a similar way we calculate the ground truth localiza¬ 
tion error as the difference between the detected loca¬ 
tions and the ground truth locations of the facial landmarks 
or the human body joints. In order to be consistent and dis¬ 
tinguish it from the mirror error we call it the alignment 
error. Formally, 
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where ^^x/^ is the ground truth location of the k-th point. In 
a similar way, we define the alignment error Pca on the mir¬ 
ror image of the test sample. For simplicity in what follows 
when we use the term of alignment error Ca, we mean the 
alignment error in the original image. 


Both Eq. 1 and Eq. 2 are absolute errors. In order to keep 
our analysis invariant to the size of the object in each image, 
we normalize them by the object size, i.e. s, the size of the 
body or the face. The size of the human body and the face 
are calculated in different ways and they are depicted when 
we use them. 

2.2. Human pose estimation 

Experiment setting In order to evaluate the mirroabil- 
ity of algorithms for human pose estimation, we focus on 
two representative methods, namely the Elexible Mixtures 
of Parts (EMP) method by Yang and Ramanan [24] and 
the Latent Tree Models (LTM) by Wang and Li [20]. The 
EMP is generally regarded as a benchmark method for hu¬ 
man pose estimation and most of the recent methods are im¬ 
proved versions or variants of it. The one by Wang and Li 
[20] introduced latent variables in tree model learning that 
led to improvements. Both of them have provided source 
code which we used in our evaluation. Since it is not our 
main focus to improve the performance in a specific do¬ 
main, we use popular state of the art approaches and eval¬ 
uate them on standard datasets. We use three widely used 
datasets, namely the Leeds Sport Dataset (ESP), the Image 
Parse dataset [13] and the Buffy Stickmen dataset [6]. We 
use the default training/test split of the datasets. The num¬ 
ber of test images on ESP, Parse and Buffy is 1000, 276 and 
205 respectively. We trained both EMP and LTM models 
on ESP and only EMP model on Parse and Buffy. We em¬ 
phasize that the training dataset is augmented with mirror 
images - this eliminates the training sample bias. 

Overall performance difference We first compare the 
overall performance on the original test set and on the mir¬ 
ror set. We use the evaluation criterion proposed in [24] and 
also recommended in [ ], namely the Percentage of Correct 
Keypoints (PCK). In order to calculate the PCK for each 
person a tightly-cropped bounding box is generated as the 
tightest box around the person in question that contains all 
of the ground truth keypoints. The size of the person is cal¬ 
culated as s = max (/i, re), where h and w are the bight 
and width of the bounding box. This is used to normalize 
the absolute mirror error in Eq. 1 and the alignment error 
in Eq. 2. The results on Buffy, Parse and ESP are shown 
in Table 1, Table 2 and Table 3 respectively. As can be 
seen, there is no significant overall difference between the 
detection results on the original images and on their mirror 
images. The maximum difference of different methods on 
different datasets is around 1% while the average difference 
less than 1%. 

Mirrorability The fact that the average performance on 
mirror images is similar to the average performance on the 
originals might be the root of the common belief that mod¬ 
els produce more or less bilaterally symmetrical results. A 


Points 

Head 

Shou 

Elbo 

Wri 

Hip 

Avg 

Original 

96.9 

97.3 

91.1 

80.8 

79.6 

89.1 

Mirror 

97.1 

98.4 

91.8 

81.9 

80.4 

89.9 


Table 1: PCK of EMP [24] on Buffy. A point is correct if 
the error is less than 0.2 * max(/i, w) 


Points Head Shou Elbo Wris Hip Knee Ankle Avg 

Original 90.0 85.6 68.3 47.3 113 75.6 67.3 73.1 

Mirror 90.0 86.1 67.6 46.3 76.8 74.6 68.5 72.8 

Table 2: PCK of EMP [24] on Parse. A point is correct if 

the error is less than 0.1 * max(/i, w). 

Points Head Shou Elbo Wris Hip Knee Ankle Avg 
FMP Original 81.2 61.1 45.5 33.4 63.0 55.6 49.5 55.6 

FMP Mirror 82.2 61.0 44.9 33.8 63.7 56.1 50.5 56.0 

LTM Original 88.5 66.0 51.3 41.1 69.7 59.2 55.6 61.6 

LTM Mirror 88.7 65.8 51.4 40.7 70.2 58.0 55.0 61.4 

Table 3: PCK of FMP [24] and LTM [20] on LSP. A point 
is correct if the error is less than 0.1 * max(/i, w). 




(a) Yang and Ramanan [24] (b) Wang and Li [20] 

Figure 2: Visualization of mirror error (numbers on the 
upper) and alignment error (values on the lower) of body 
joints. The values are percentages of the body size. The ra¬ 
dius of each ellipse represents the value of one standard de¬ 
viation of the mirror error on the corresponding body joint. 

closer inspection however reveals that this is not true. Let 
us first visualize the mirror error of individual body joints, 
i.e., of both FMP and LTM on the LSP 

dataset. In Fig 2 we plot the mirror error (normalized by the 
body size in the example image) of the 1000 test images on 
each individual joint. As can be seen, there is a difference 
which in some cases it is quite large. For example on the 
elbows, feet and especially on the wrists (^18% for FMP 
and ^ 20% for LTM). This result directly challenges the 
perception that the models give mirror symmetrical results. 
We reiterate that this is despite the fact that the overall per¬ 
formance is similar in the original and the mirror images 
and despite the fact that we have augmented the training set 
with the mirror images. This leads us to the conclusion that 
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Figure 3: Mirror error and alignment error on LSP of LTM 
[20]. The X axis is the image indexes after sorting the align¬ 
ment error in ascend. Two example images and their mirror 
images are shown, one with small mirror error and the other 
with large mirror error. 


the low mirrorability (i.e. large mirror error) is not the result 
of sample bias. 

It is interesting to observe in Fig. 2 that the joints with 
large average mirror error are usually the most challeng¬ 
ing to localize, that is they are the ones with the higher 
alignment error. This seems to indicate that there is cor¬ 
relation between the mirror error and the alignment error. 
In Fig. 3, as an example, we show the mirror error vs. the 
sorted sample-wise alignment error of LTM on LSP dataset. 
It is clear that the mirror error tends to increase as the im¬ 
age alignment error increases. Two examples of pairs of 
images are shown in Fig. 3 and the correlation between the 
sample-wise mirror error and the alignment error are shown 
them in Fig. 4. On all three datasets the mirror error has 
shown a strong correlation to the alignment error. For the 
smaller datasets, Buffy and Parse the correlation coefficient 
is around 0.6. On the larger LSP dataset, the correlation 
coefficient of both LTM and FMP is around 0.7. We can 
conclude that although the mirror error is calculated with¬ 
out knowledge of the ground truth, it is informative of the 
real alignment error in each sample. 

2.3. Face alignment 

Face alignment has been intensively studied and most 
of the recent methods have reported close-to-human perfor¬ 
mance on face images ”in the wild”. Here, we look into the 
mirrorability of face alignment methods and how their error 
is correlated to the mirror error. 

Experiment setting For our analysis we focus on the 
most challenging datasets collected in the wild, namely the 
300W. It is created for Automatic Facial Landmark Detec¬ 
tion in-the-Wild Challenge [15]. To this end, several pop¬ 
ular data sets including LFPW [3], AFW [27] and HELEN 
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(c) Yang and Ramanan [24] on LSP (d) Wang and Li [20] on LSP 
Figure 4: Correlation between the alignment error and mir¬ 
ror error. The correlation coefficients are shown above the 
figures. 


[10] were re-annotated with 68 points mark-up and a new 
data set, called iBug, was added. We perform our analysis 
on a test set that comprises of the test images from HELEN 
(330 images), LFPW (224 images) and the images in the 
iBug subset (135 images), that is 689 images in total. The 
images in the iBug subset are extremely challenging due to 
the large head pose variations, faces that are partially out¬ 
side the image and heavy occlusions. The test images are 
flipped horizontally to get the mirror images. We evaluate 
the performance of several recent state of the art methods, 
namely the Supervised Descent Method (SDM) [22], the 
Robust Cascaded Pose Regression (RCPR) [4], the Incre¬ 
mental Face Alignment (IFA) [2] and the Gaussian-Newton 
Deformable Part Model (GN-DPM) [19]. For SDM, IFA 
and GN-DPM, only the trained models and the code for test¬ 
ing is available - we use those to directly apply them on the 
test images. As stated in the corresponding papers, the IFA 
and GN-DPM were trained on the 300W dataset and the 
SDM model was trained using a much larger dataset. SDM, 
IFA and GN-DPM only detect the 49 inner facial points - 
our analysis on those methods is therefore based on those 
points only. For RCPR, for which the code for training is 
available, we retrain the model on the training images of 
300W for the full 68 facial points mark-up. All those meth¬ 
ods build on the result of a face detector - since most of 
them are sensitive to initialization, we carefully choose the 
right face detector for each one to get the best performance. 
More specifically, for the IFA and GN-DPM we use the 
300W face bounding boxes and for SDM and RCPR we use 
the Viola-Jones bounding boxes, that is for each method we 
used the detector that it used during training. For the meth- 






































Image Index 


Figure 5: Mirror error and alignment error of RCPR [4] 
on 300W test images. Results are calculated over 68 facial 
points. 



Image Index 


Figure 6: Mirror error and alignment error of GN-DPM [19] 
on 300W test images. Results are calculated over 49 inner 
facial points. 


ods that use the Viola-Jones bounding boxes, we checked 
manually to verify that the detection is correct - for those 
face images on which the Viola-Jones face detector fails, 
we adjust the 300W bounding box to roughly approximate 
the Viola-Jones bounding box. 


Mirrorability We calculated the mirror error and the 
alignment error for each of the 689 test samples in 300W 
for SDM, IFA, GN-DPM and RCPR. In Fig. 6 and Fig. 5 
we show the errors for two of the algorithms, i.e., the GN- 
DPM and the RCPR. The former is a representative local- 
based method and the latter a representative holistic-based 
method. Similar results were obtained for SDM and IFA. 
In each figure, two pairs of example images are shown - 
one with low mirror error (lower left corner) and one with 
large mirror error (upper right corner). We sort the sample- 
wise alignment error in ascending order and plot it together 
with the corresponding sample mirror error. It is clear that 
although GN-DPM and the RCPR work in a very differ¬ 
ent way, for both the mirror error tends to increase as the 




(c) IFA [2], 49P 


(d) GN-DPM [19], 49P 


Figure 7: Correlation between the alignment error and the 
mirror error of various state of the art face alignment meth¬ 
ods. The correlation coefficients are shown above the fig¬ 
ures. 


alignment error increases. There are a few impulses in the 
lower range of the red curve, i.e., low and high e^. This 
means that although the algorithm has small alignment er¬ 
ror on the original samples it has large error on the mirror 
images, i.e., is high. There are three cases that result in 
high mirror error: 1) low *^6^ and high ^6^; 2) high *^6^ and 
low Pca (shown in Fig. 5 upper right corner); 3) high 
and high (shown in Fig. 6 upper right corner). Finally, 
in order to quantify this insight, we present the correlation 
between the mirror error and the alignment error in Fig. 7. 
In all of the four methods there is a strong correlation be¬ 
tween the mirror error and the alignment error with correla¬ 
tion coefficients ranging from 0.64 to 0.74 - these are very 
high. 

3. Mirrorability Applications 

In the previous sections we have shown that one of the 
nice properties of the mirror error is that it is strongly corre¬ 
lated with the object alignment error, that is with the ground 
truth error. In this section we show how it can be used in two 
practical applications, namely for selecting difficult samples 
and for providing feedback in a cascaded face alignment 
method. 

3.1. Difficult samples selection 

For any computer vision task, including face alignment, 
it is generally accepted that some samples are relatively 
more difficult than others, that is the error of the algorithm 
on them is higher. However, it is very difficult to estimate 
a measure of how well the algorithm has performed on a 
given sample without knowledge of the ground truth. Such 
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Figure 8: Consistency measure of ’difficult’ samples detection, with M = 150. 
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a measure would be very useful, for example in order to se¬ 
lect a proper alignment model for a given dataset or to select 
which samples to annotate in an Active Learning scheme. 
Here, we show how the mirror error can be used for select¬ 
ing difficult samples in the problem of face alignment. In 
order to do so we apply several methods (IFA, SDM, GN- 
DPM, RCPR) on the test images of the 300W and get the 
detection results. Then we sort the normalized mirror error 
Cm in descending order and select the first M samples as 
being the most difficult ones. We denote this set as Se ^. 

In order to evaluate whether the samples that we have 
selected in this way are truly ’difficult’ we measure the sim¬ 
ilarity between the set containing those M selected samples 
and the set Se^ that contains the M samples that have the 
largest alignment error for each method. We use a mea¬ 
sure that we call consistency which we define as the fraction 
of the common samples between the two sets, that is 


where | S'! D £' 2 1 is the size of the intersection of Si and S 2 . 
For each method i, we calculate two sets each containing M 
samples, i.e., Sl^ and Sl^. We set the value of M to 150. 
The chance rate is where M is the number of selected 
and N is the size of the dataset - in our case is ^ 0.22. 

The pairwise consistency rate matrix of Sl^ and Sl^ is 
shown in Fig. 8a, where in a certain row we show the con¬ 
sistency between the SI of a certain method with the SI 
of all methods, including the method itself. Note that the di¬ 
agonal does not contain ones, since Sl^ are the M samples 
with the highest mirror error and Sl^ the M samples with 
the highest alignment error. As it can be seen, the consis¬ 
tency between the two sets of samples for a specific method 
(i.e., the diagonal values) are all above 0.7 - the highest is 
0.81 for RCPR. More interestingly, the consistency across 
different methods, i.e., the M samples selected according 
to Ca for a method in a certain row and the M samples se¬ 
lected according to in a certain column is high, with val¬ 
ues ranging from 0.56 to 0.68. This shows that the samples 


that we have selected are truly ’difficult’, not only for the 
method employed in the selection process but also for the 
other face alignment methods. In other words this shows 
that the methods that we have examined have difficulties 
with the same images. 

Second, we evaluate the consistency across different ap¬ 
proaches, i.e., the consistency of ’difficult’ samples found 
by different approaches. Thus, we calculate the pair¬ 
wise consistency of Sl^ of those methods as shown in 
Fig. 8b. The resulting values are clearly much higher than 
the chance value of 0.22. In Fig. 8c we depict the ’opti¬ 
mal’ case where the ground truth, that is the alignment er¬ 
ror itself, is used to calculate the pairwise consistency. We 
observe that the consistency calculated by our selection pro¬ 
cess is very close to the one calculated based on the ground 
truth. We can further conclude that: 

• the difficulty of samples is shared by the different 
methods that we have examined. 

• the difficult samples selected by the mirror error show 
high consistency across different approaches. 

3.2. Feedback on cascaded face alignment 

In recent years cascaded methods like SDM [22], IFA 
[2], CFAN [25] and RCPR [4] have shown promising re¬ 
sults in face alignment. Although they differ in terms of 
the regressor and the features that they use in each iteration 
they all follow the same strategy. The methods start from 
one or several initializations of the face shape, that are often 
calculated from the face bounding box, and then iteratively 
refine the estimation of the face shape by applying at each 
iteration a regressor that estimates the udpate of the shape. 
These methods are intrinsically sensitive to the initialization 
[4, 25] . As stated in [23], only initializations that are in a 
range of the optimal shape can converge to the correct so¬ 
lution. To address this problem, [5] proposed to use several 
random initializations and give the final estimate as the me¬ 
dian of the solutions to which they convergence. However, 
having several randomly generated initializations does not 




























guarantee that the correct solution is reached. The ’smart 
restart’ proposed in [4] has improved the results to a certain 
degree. The scheme starts from different initializations and 
apply only 10% of the cascade. Then, the variance between 
the predictions is checked. If the variance is below a certain 
threshold, the remaining 90% of the cascade is applied as 
usual. Otherwise the process is restarted with a different set 
of initializations. 

Here, we propose to use the mirror error as a feedback 
to close this open cascaded system. More specifically, for 
a given test image we first create its mirror image. Then 
we apply the RCPR model on the original test image and 
the mirror image and calculate the mirror error. If the mir¬ 
ror error is above a threshold we restart the process using 
different initializations, otherwise we keep the detection re¬ 
sults. This procedure can be applied until the mirror error is 
below a threshold, or until a maximum number of iterations 
M is reached. In contrast to the original RCPR method that 
keeps only the results from the last set of initializations, we 
keep the one that has the smallest mirror error. This makes 
sense since new random initializations do not necessarily 
lead to better results than past initializations. 

First we evaluate the effectiveness of our feedback 
scheme. Ideally, the restart will be initiated only when the 
current initialization is unable to lead to a good solution. 
Treating it as a two class classification problem we report 
results using a precision-recall based evaluation. A face 
alignment is considered to belong to the ’good’ class if the 
mean alignment error is below 10% of the inter-ocular dis¬ 
tance, otherwise, it is considered to belong to the ’bad’ class 
- in the latter case a re-start is needed. The precision is the 
number of samples classified correctly as belonging to the 
’bad’ (positive) class divided by the total number of samples 
that are classified as belonging to the ’bad’ class. Recall in 
this context is defined as the number of true positives di¬ 
vided by the total number of samples that belong to the bad 
class. For a fair comparison, we adjust our threshold on the 
mirror error (i.e. the threshold above which we restart the 
cascade with a different initialization) to get similar recall 
as the RCPR with smart re-start [4] gets using its default pa¬ 
rameters. We note that our parameter can also be optimized 
by cross validation for better performance. As can be seen 
in Fig. 9, at a similar recall level, our proposed scheme has 
significantly higher precision (0.65 vs. 0.25) than that of 
RCPR ’smart re-start’, this verifies that our method is more 
effective in selecting samples for which restarting initializa¬ 
tions are needed. 

Second, we evaluate the improvement in the face align¬ 
ment that we obtain using our proposed feedback scheme. 
We compare to 1) RCPR without restart (RCPR-0), 2) 
RCPR with the smart restart of [4] (RCPR-S) and 3) other 
state of the art methods. We create two versions of our 
method. The first version, RCPR-Fl, uses 5 initializations 



(a) Original RCPR restart scheme. Presion=0.25, Recall = 0.63. 



Figure 9: Restart scheme of our method vs. RCPR [4] (best 
viewed in color). 


Methods 

RCPR-F2 

RCPR-Fl 

RCPR-S RCPR-0 

SDM 

IFA 

GN-DPM 

CFAN 

49P 

5.35 

6.07 

6.59 

7.14 

7.12 

8.31 

12.42 

7.24 

68P 

6.25 

7.11 

7.42 

7.73 

- 

- 

- 

7.72 


Table 4: 49/68 facial landmark mean error comparison . 


and at most two restarts - this allows direct comparison to 
the baseline method that uses the same number of initial¬ 
izations and restarts. The second version, RCPR-F2, uses 
10 initializations and at most 4 times of restarts - this ver¬ 
sion produces better results and still has good runtime per¬ 
formance. We compare to SDM [22], IFA [2], GN-DPM 
[19] and CFAN [25] - all of those have publicly available 
software and report good results. The results of the com¬ 
parison is shown in Table 4. We compare the normalized 
alignment error of the common 49 inner facial landmarks 
for all of these methods and the 68 facial landmarks when¬ 
ever this is possible. On the challenging 300W test set, with 
our proposed feedback scheme, the RCPR method has the 
best performance compared to not only the original version 
of RCPR but also to all the other methods. Although good 
performance is obtained on the face alignment problem, we 
emphasize that the main focus of this work is to bring atten¬ 
tion to the mirroability of object localization models. 

4. Related Work 

As a method that estimates the quality of the output of 
a vision system, our method is related to works like the 
meta-recognition [16], face recognition score analysis [21] 
and the recent failure alert [26] for failure prediction. Our 
method differs from those works in two prominent aspects 
(1) we focus on fine-grained object part localization prob¬ 
lem while they focus on instance level recognition or detec¬ 
tion. (2) we do not train any additional models for evalua¬ 
tion while all those methods rely on meta-systems. In the 
specific application of evaluating the performance of Hu¬ 
man Pose Estimation, [9] proposed an evaluation algorithm, 
however, again such an evaluation requires a meta model 










































and it only works for that specific application. 

Our method is also very different from object/feature de¬ 
tection methods that exploit mirror symmetry as a constraint 
in model building [18, 12]. We note that our model does not 
assume that the detected object or shape appears symmetri¬ 
cally in an image - such an assumption clearly does not hold 
true for the articulated (human body) and deformable (face) 
objects that we are dealing with. None of the methods that 
we have exploited in this paper explicitly used the appear¬ 
ance symmetry in model learning. Our method only utilizes 
the mirror symmetry property to map the object parts be¬ 
tween the original and mirror images. 

Developing transformation invariant vision system has 
drawn much attention in the last decades. Examples are 
the rotation invariant face detection method [14] and the 
scale invariant feature transform (SIFT) [11], which han¬ 
dle efficiently several transformations including the mirror 
transformation. Recently, Gens and Domingos proposed the 
Deep Symmetry Networks [8] that use symmetry groups to 
represent variations - it is unclear though how the proposed 
method can be applied for object part localization. Szegedy 
et al [17] has studied some intriguing properties of neu¬ 
ral networks when dealing with certain artificial perturba¬ 
tions. Our method focuses on examining the performance 
of object part localization methods on one of the simplest 
transforms, i.e. mirror transformation, and drawing useful 
conclusions. 

5. Conclusion and Discussion 

In this work, we have investigated how state of the art oh- 
ject localization methods behave on mirror images in com¬ 
parison to how they behave on the original ones. Surpris¬ 
ingly, all of the methods that we have evaluated on two 
representative problems, struggle to get mirror symmetric 
results despite the fact that they were trained with datasets 
that were augmented with the mirror images. 

In order to qualitatively analyze their behavior, we intro¬ 
duced the concept of mirrorability and defined a measure 
called the mirror error. Our analysis let to some interesting 
findings in mirrorability, among which a high correlation 
between the mirror error and ground truth error. Further, 
since the ground truth is not needed to calculate the mirror 
error, we show two applications, namely difficult samples 
selection and cascaded face alignment feedback that aids a 
re-initialization scheme. We believe there are many other 
potential applications in particular in Active Learning. 

The findings of this paper raise several interesting ques¬ 
tions. Why some methods have shown better performance 
in terms of absolute mirror error, for example SDM is 
smaller and RCPR is bigger? Can the design of algorithms 
with low mirrorability error lead to algorithms with good 
overall performance? We believe these are all interesting 
research problems for future work. 
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